ENH: Add orient=tight format for dictionaries #35292

Dr-Irv · 2020-07-15T17:36:26Z

xref to_json of df with multi-index header, followed by read_json results in incorrectly formatted column headers #4889 (see below)
tests added / passed
- tests.frame.methods.test_to_dict.test_to_dict_orient_tight()
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The issue in #4889 contains examples of how the JSON format doesn't support MultiIndex for the index or the columns, independent of the orient chosen. Also, none of the orientations support index names (or MultiIndex names). Finally, if you want to have a JSON format that is "tight" (or "compact"), the split is the closest thing, but it is incomplete due to the indexing issues.

As a first step to addressing this, I have created a orient='tight' option to DataFrame.to_dict() and DataFrame.from_dict() . If we agree that this is a reasonable representation, the next step (which I'd prefer to do in a second PR) would be to also support it with JSON. The challenge there is that to_json() is currently implemented in C. Right now, one can just use json.dumps(df.to_dict(orient='tight')) to get the needed JSON as a workaround.

I'm open to changing the word 'tight' to something else, but I wanted it to start with a different letter than the other orientations.

WillAyd · 2020-07-21T19:20:38Z

Hmm +/- 0 on adding anything to to_dict here; we already have a lot of formats that kind of do the same things but not really and it's not always spelled out. While this could have some utility I think it adds to that confusion

The challenge there is that to_json() is currently implemented in C.

If the ultimate goal is to include in the table schema most of that is written in Python, so maybe not as much of a concern?

Dr-Irv · 2020-07-21T19:25:45Z

Hmm +/- 0 on adding anything to to_dict here; we already have a lot of formats that kind of do the same things but not really and it's not always spelled out. While this could have some utility I think it adds to that confusion

The challenge there is that to_json() is currently implemented in C.

If the ultimate goal is to include in the table schema most of that is written in Python, so maybe not as much of a concern?

@WillAyd Here are two important comments in the original issue that this PR addresses:

#4889 (comment) states (and I agree with this observation) that the table format is not compact when it comes to JSON. The size of the 'table' format is not as compact as possible when you need to push JSON around as the payload in a REST API.

#4889 (comment) states the issue with not preserving the index correctly.

jreback

some comments, looks pretty good if you'd update the notes to 1.2

jreback · 2020-09-24T01:44:34Z

pandas/core/frame.py

+        else:
+            realdata = data["data"]
+
+            def create_index(indexlist, namelist):


I think ensure_index does this

@Dr-Irv can you review this comment here

@jreback ensure_index doesn't handle the names of the indexes (or columns), so you'd still have logic to handle that part anyway.

does this work
e..g just use Index

In [98]: pd.Index([(1, 0), (1, 1)], names=['foo', 'bar']) Out[98]: MultiIndex([(1, 0), (1, 1)], names=['foo', 'bar']) In [99]: pd.Index([1, 2], names=['foo']) Out[99]: Int64Index([1, 2], dtype='int64')

No, because note that the argument to set the names of a MultiIndex is names, while for an Index it is name . In your example in cell 99, you lost the name of the index.

grr this is a bug, actually is this still true on master (this was a pretty old version i am using)

Still an issue. Using pd.Index, with names, you get a FutureWarning. With name, you get a stack trace.

In [1]: import pandas as pd In [2]: pd.__version__ Out[2]: '1.4.0.dev0+905.gdace93d694' In [3]: pd.Index([(1,0), (1,1)], names=["foo", "bar"]) <ipython-input-3-99fe0b033485>:1: FutureWarning: Passing keywords other than 'data', 'dtype', 'copy', 'name', 'tupleize_cols' is deprecated and will raise TypeError in a future version. Use the specific Index subclass directly instead. pd.Index([(1,0), (1,1)], names=["foo", "bar"]) Out[3]: MultiIndex([(1, 0), (1, 1)], names=['foo', 'bar']) In [4]: pd.Index([(1,0), (1,1)], name=["foo", "bar"]) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-4-138c3fe61553> in <module> ----> 1 pd.Index([(1,0), (1,1)], name=["foo", "bar"]) c:\Code\pandas_dev\pandas\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs) 403 from pandas.core.indexes.range import RangeIndex 404 --> 405 name = maybe_extract_name(name, data, cls) 406 407 if dtype is not None: c:\Code\pandas_dev\pandas\pandas\core\indexes\base.py in maybe_extract_name(name, obj, cls) 6876 # GH#29069 6877 if not is_hashable(name): -> 6878 raise TypeError(f"{cls.__name__}.name must be a hashable type") 6879 6880 return name TypeError: Index.name must be a hashable type

nvm we removed names=. i still find this a bit of a footgun.

ok i think we need to handle this a bit bitter if you wouldn't mind opening an issue.

@jreback Seems like we ought to have another issue with respect to pd.Index([(1,0), (1,1)], name=["foo", "bar"]) not returning a MultiIndex ??

Dr-Irv · 2020-09-25T18:03:09Z

@jreback before I make the change you suggested, when we had the dev sprint on 9/4/20, we discussed this PR, and it was suggested that I add an argument such as preserve_names to to_dict() that would apply to the "split" format rather than introduce the "tight" option.

So what is your opinion about introducing orient="tight" versus orient="split", preserve_names=False (where the latter would be used to preserve compatibility with prior versions)?

simonjayhawkins · 2021-03-14T15:12:09Z

@Dr-Irv what's the status here?

Dr-Irv · 2021-03-14T15:15:30Z

@Dr-Irv what's the status here?

I've been waiting for @jreback to respond to this comment: #35292 (comment)

I would like to introduce orient="tight", but I'm not sure there is agreement on that. If we can agree, I'll move forward with it.

jreback · 2021-10-04T00:12:06Z

looks ok, can you merge master and will look

Dr-Irv · 2021-10-04T16:01:12Z

@jreback have merged with master and all checks passed. Changed it from draft status.

jreback

looks good a comment

jreback · 2021-10-16T14:48:58Z

pandas/core/frame.py

+        else:
+            realdata = data["data"]
+
+            def create_index(indexlist, namelist):


@Dr-Irv can you review this comment here

jreback · 2021-10-16T15:50:57Z

thanks @Dr-Irv

Dr-Irv added 6 commits July 15, 2020 13:28

Add orient=tight format for dictionaries

bed9772

fix up documentation formatting

520f9c5

code formatting in docs

e46d5ac

fix indentation in docstrings

06c3f2a

add period to docstring for from_dict orient

d8e68f9

add blank line in docstring

1ef32fc

WillAyd added IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action labels Jul 21, 2020

simonjayhawkins mentioned this pull request Sep 15, 2020

CI: Add stale PR action #36336

Merged

Dr-Irv marked this pull request as draft September 22, 2020 14:09

jreback requested changes Sep 24, 2020

View reviewed changes

Merge branch 'master' into issue4889

344687f

WIP on adding to_dict

f7998d3

Dr-Irv added 2 commits October 4, 2021 09:39

Merge with master

038eff5

change versionchanged to 1.4.0. Fix typing issue, update API calls

77fec12

Dr-Irv marked this pull request as ready for review October 4, 2021 13:53

Merge branch 'master' into issue4889

dace93d

jreback requested changes Oct 16, 2021

View reviewed changes

jreback approved these changes Oct 16, 2021

View reviewed changes

jreback merged commit 7f2b418 into pandas-dev:master Oct 16, 2021

Dr-Irv mentioned this pull request Oct 16, 2021

BUG: Creating a MultiIndex using pd.Index and the name argument is now broken (worked in 1.2.5 with names) #44052

Open

3 tasks

Dr-Irv deleted the issue4889 branch February 13, 2023 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add orient=tight format for dictionaries #35292

ENH: Add orient=tight format for dictionaries #35292

Dr-Irv commented Jul 15, 2020 •

edited by jreback

Loading

WillAyd commented Jul 21, 2020

Dr-Irv commented Jul 21, 2020

jreback left a comment

jreback Sep 24, 2020

jreback Oct 16, 2021

Dr-Irv Oct 16, 2021

jreback Oct 16, 2021 •

edited

Loading

Dr-Irv Oct 16, 2021

jreback Oct 16, 2021

Dr-Irv Oct 16, 2021

jreback Oct 16, 2021

jreback Oct 16, 2021

Dr-Irv Oct 16, 2021

Dr-Irv commented Sep 25, 2020

simonjayhawkins commented Mar 14, 2021

Dr-Irv commented Mar 14, 2021

jreback commented Oct 4, 2021

Dr-Irv commented Oct 4, 2021

jreback left a comment

jreback Oct 16, 2021

jreback commented Oct 16, 2021

ENH: Add orient=tight format for dictionaries #35292

ENH: Add orient=tight format for dictionaries #35292

Conversation

Dr-Irv commented Jul 15, 2020 • edited by jreback Loading

WillAyd commented Jul 21, 2020

Dr-Irv commented Jul 21, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Oct 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv commented Sep 25, 2020

simonjayhawkins commented Mar 14, 2021

Dr-Irv commented Mar 14, 2021

jreback commented Oct 4, 2021

Dr-Irv commented Oct 4, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 16, 2021

Dr-Irv commented Jul 15, 2020 •

edited by jreback

Loading

jreback Oct 16, 2021 •

edited

Loading